Overview

Obesity is a growing problem in America and the goal of my project is to focus on obesity in the Philadelphia region and to try to determine what factors influence obesity. I will be applying statistical and machine learning analysis to the Southeastern Pennsylvania Household Survey results from 2015 to determine what factors are the greatest predictors for obesity. You can get access to my Github repository here.

I conceptualized my project after speaking with three faculty members: Andria Johnson, Rebecca Hubbard, and Ariana Chao. Andria Johnson is a Professor in the Department of History and Sociology of Science and recommended that I look at health disparities in the local Philadelphia region. Rebecca Hubbard is an Associate Professor of Biostatistics in the Department of Biostatistics, Epidemiology and Informatics and suggested using machine learning techniques to create a model to predict health outcomes and determine source variability. Ariana Chao is an Assistant Professor of Nursing and reccomended some datasets and discussed the impact of mental health on BMI and other health outcomes as a possibile route.

Introduction

The obesity epidemic in America is getting worse and current initiatives are not as effective as we’d hope. As of 2015-2016, about four in 10 U.S. adults were obese, up from 37.7 percent during 2013-2014. The news for children and teens isn’t much better. Overall, nearly 19 percent were obese in 2015-2016, up from about 17 percent during the previous two years. Obesity a serious concern because it’s a risk factor for many health conditions, such as diabetes, heart disease, stroke and even some kinds of cancer. Federal, state and local health policymakers need to continue campaigns that promote good nutrition and exercise, the food and beverage industry need to increase the supply of affordable, healthy, nutritious foods and fewer sugary drinks, and consumers need to demand healthier products and policies in their communities. I’m interested in looking at obesity in the Philadelphia region and used the Southeastern Pennsylvania Household Survey results from 2015 to determine what factors are the greatest predictors for obesity in hopes of influencing health initiatives and directing effects to subpopulations that are affected the most.

This problem interdiciplinary as it spans several fields. There is a large data science aspect since the analysis is performed through coding in R. The problem itself draws insights from epidemiology, public health and policy, and sociology. Since healthcare policies are so influential, it’s vital that we have adequate data and effective analysis in order to inform these decisions. Datasets will need to be properly cleaned and transformed, making note of missing values. Additionally, being able to apply context to the obtained results and drawing the correct conclusions is crucial in best representing the population.

Methods

The data I am analyzing is from the 2014-2015 Southeastern Pennsylvania Household Health Survey with results from adults and children combined. After loading in the data, variables will be converted to factor or numeric accordingly. Finally, interesting variables will be filtered and cleaned up for analysis.

library(haven)
library(tidyverse)

# read in raw data
raw.data <- read_sav(url("https://raw.githubusercontent.com/joycehelenwang/BMIN503_Final_Project/master/HS15COM1b.sav"))

# convert to factors
data <- as_factor(raw.data)

# clean up variables
data$NUMADULT <- as.numeric(as.character(data$NUMADULT))

table(data$RACE2)
## 
## White (Not-Latino) Black (Not-Latino)     Latino (total)              Asian     Biracial/Multi    Native American              Other 
##               9024               2673                829                255                349                 49                  5
levels(data$RACE2) <- c("White", "Black", "Latino", "Asian", "Multi", "Native American", "Other")

Results

Exploratory Data Analysis

First, I wanted to look at the breakdown of obesity of weight categoties in the Philadelphia region. The obesity rate for adults is about 31% and for children is about 17% - both less than the national average.

# obesity
obesity <- data %>%
  select(Obesity = OBESITY2, Respondant = FLAGCOM) %>%
  na.omit
ggplot(obesity, aes(Obesity)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

table(obesity)
##              Respondant
## Obesity       Adult Child
##   Underweight   135   179
##   Normal       3292  1488
##   Overweight   3499   352
##   Obese        3051   405

Then I looked at overall demographics.

## demographics
# county
county <- data %>%
  select(Obesity = OBESITY2, County = COUNTY) %>%
  na.omit
ggplot(county, aes(County)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

ggplot(county, aes(County)) +
  geom_bar(position = "fill", aes(fill = Obesity)) + 
  labs(y = "Proportion")

# income
income <- data %>%
  select(Income = INCOME) %>%
  na.omit
ggplot(income, aes(Income)) +
  geom_bar(aes(fill=Income)) +
  labs(y = "Count") +
  theme(legend.text=element_text(size=5), 
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

income2 <- data %>%
  select(Obesity = OBESITY2, Income = INCOME) %>%
  na.omit
ggplot(income2, aes(Obesity)) +
  geom_bar(position = "fill", aes(fill = Income)) +
  labs(y = "Proportion") +
  theme(legend.text=element_text(size=4))

# sex and age
age <- data %>%
  select(Obesity = OBESITY2, Age = AGE) %>%
  na.omit
ggplot(age, aes(Obesity, Age)) +
  geom_boxplot(fill = "skyblue2")

ggplot(age, aes(Age)) +
  geom_bar(aes(fill = Obesity), position = "fill") + 
  labs(y = "Proportion")

sex <- data %>%
  select(Obesity = OBESITY2, Sex = SEX) %>%
  na.omit
ggplot(sex, aes(Sex)) +
  geom_bar(aes(fill = Obesity)) + 
  labs(y = "Count")

# race
race <- data %>%
  select(Obesity = OBESITY2, Race = RACE2) %>%
  na.omit
ggplot(race, aes(Race)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

ggplot(race, aes(Race)) +
  geom_bar(aes(fill = Obesity), position = "fill") + 
  labs(y = "Proportion")

# household size
kids <- data %>%
  select(Obesity = OBESITY2, Kids = TOTKIDS) %>%
  na.omit
ggplot(kids, aes(Kids)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

adults <- data %>%
  select(Obesity = OBESITY2, Adults = NUMADULT) %>%
  na.omit
ggplot(adults, aes(Adults)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

size <- data %>%
  select(Obesity = OBESITY2, Kids = TOTKIDS, Adults = NUMADULT) %>%
  na.omit %>%
  mutate(Household = Kids + Adults)
ggplot(size, aes(Household)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

ggplot(size, aes(Obesity, Household)) +
  geom_boxplot(fill = "skyblue2") +
  labs(y = "Count")

# employment
employment <- data %>%
  select(Obesity = OBESITY2, Employment = MAINEMPL) %>%
  na.omit
ggplot(employment, aes(Employment)) +
  geom_bar(fill = "skyblue2") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(y = "Count")

ggplot(employment, aes(Employment)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(y = "Proportion") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

# education
education <- data %>%
  select(Obesity = OBESITY2, Education = RSPGRAD2) %>%
  na.omit
ggplot(education, aes(Education)) +
  geom_bar(fill = "skyblue2") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(y = "Count")

ggplot(education, aes(Education)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(y = "Proportion") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

# marital status
marital <- data %>%
  select(Obesity = OBESITY2, Marital = RESPMAR) %>%
  na.omit
ggplot(marital, aes(Marital)) +
  geom_bar(fill = "skyblue2") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(y = "Count")

ggplot(marital, aes(Marital)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(y = "Proportion") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

# sexual identity
sexident <- data %>%
  select(Obesity = OBESITY2, SexIdent = SEXIDENT) %>%
  na.omit
ggplot(sexident, aes(SexIdent)) +
  geom_bar(fill = "skyblue2") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(x = "Sexual Idendity", y = "Count")

ggplot(sexident, aes(SexIdent)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(x = "Sexual Identity", y = "Proportion") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Interesting trends I noted from the graphs:

  • Greatest obesity in Philadelphia county
  • Median income is $75K-100K which is higher than the national average of $60K
  • Income decreases with weight
  • Obesity increases with age
  • Respondants are majority white (70% of sample)
  • African Americans are the most obese and Asians are the least
  • Respondants unable to work or umemployed and not looking for work are most obese
  • Full-time employed and student are least obese
  • Obesity decreases with highest education level

Next, I wanted to explore if access to care and insurance type influenced obesity.

library(reshape2)

# insurance rate
insurance.data <- data %>%
  select(Obesity = OBESITY2, Insured = INSURED) %>%
  na.omit
ggplot(insurance.data, aes(Obesity)) +
  geom_bar(aes(fill = Insured), position = "fill") +
  labs(y = "Proportion")

# usual go
usual <- data %>%
  select(Obesity = OBESITY2, Usual = USUALGO) %>%
  na.omit
ggplot(usual, aes(Usual)) +
  geom_bar(fill = "skyblue2") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(y = "Count")

ggplot(usual, aes(Usual)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(y = "Proportion") + 
  theme(axis.text.x = element_text(angle=45, hjust=1))

# types of insurance - count number of yes
work <- data %>%
  group_by(OBESITY2) %>%
  count(WORKCOM) %>%
  na.omit %>%
  filter(WORKCOM == "Yes") %>%
  select(Obesity = OBESITY2, Work = n)

self <- data %>%
  group_by(OBESITY2) %>%
  count(SELFCOM) %>%
  na.omit %>%
  filter(SELFCOM == "Yes") %>%
  select(Obesity = OBESITY2, Self = n)

meda <- data %>%
  group_by(OBESITY2) %>%
  count(MEDACOM) %>%
  na.omit %>%
  filter(MEDACOM == "Yes") %>%
  select(Obesity = OBESITY2, MedA = n)

medb <- data %>%
  group_by(OBESITY2) %>%
  count(MEDBCOM) %>%
  na.omit %>%
  filter(MEDBCOM == "Yes") %>%
  select(Obesity = OBESITY2, MedB = n)

ma <- data %>%
  group_by(OBESITY2) %>%
  count(MACOM) %>%
  na.omit %>%
  filter(MACOM == "Yes") %>%
  select(Obesity = OBESITY2, MA = n)

va <- data %>%
  group_by(OBESITY2) %>%
  count(VACOM) %>%
  na.omit %>%
  filter(VACOM == "Yes") %>%
  select(Obesity = OBESITY2, VA = n)

other <- data %>%
  group_by(OBESITY2) %>%
  count(OTHERCOM) %>%
  na.omit %>%
  filter(OTHERCOM == "Yes") %>%
  select(Obesity = OBESITY2, Other = n)

insurance.type <- list(work, self, meda, medb, ma, va, other) %>% 
  reduce(full_join, by = "Obesity")
insurance <- melt(insurance.type, id.vars='Obesity')

ggplot(insurance, aes(Obesity, value)) +
  geom_bar(stat = "identity", position = "fill", aes(fill=variable)) +
  labs(y = "Proportion") +
  scale_fill_discrete(name = "Insurance Type")

ggplot(insurance, aes(variable, value)) +
  geom_bar(stat = "identity", fill = "skyblue2") +
  labs(x = "Insurance Type", y = "Count") +
  scale_fill_discrete(name = "Obesity")

ggplot(insurance, aes(variable, value)) +
  geom_bar(stat = "identity", position = "fill", aes(fill=Obesity)) +
  labs(x = "Insurance Type", y = "Proportion") +
  scale_fill_discrete(name = "Obesity")

Interesting trends I noted from the graphs:

  • Most obese people were insured by Medicaid

Next, I looked at other health outcomes to see if there was any relationship to obesity.

## health outcomes
# overall health
health <- data %>%
  select(Obesity = OBESITY2, Health = HEALTH) %>%
  na.omit
ggplot(health, aes(Health)) +
  geom_bar(fill = "skyblue2") +
  labs(y = "Count")

ggplot(health, aes(Health)) +
  geom_bar(aes(fill = Obesity), position = "fill") +
  labs(y = "Proportion")

# dentist
dentist <- data %>%
  select(Obesity = OBESITY2, Dentist = DENTIST) %>%
  na.omit
ggplot(dentist, aes(Obesity)) +
  geom_bar(position = "fill", aes(fill = Dentist)) + 
  labs(y = "Proportion")

# smoking
smoke <- data %>%
  select(Obesity = OBESITY2, Smoking = SMOKHOME) %>%
  na.omit
ggplot(smoke, aes(Obesity)) +
  geom_bar(position = "fill", aes(fill = Smoking)) + 
  labs(y = "Proportion")

# asthma
asthma <- data %>%
  group_by(OBESITY2) %>%
  count(EVRASTH) %>%
  na.omit %>%
  rename(Obesity = OBESITY2, Asthma = EVRASTH, Count = n)
ggplot(asthma, aes(Obesity, Count)) +
  geom_bar(stat = "identity", position = "fill", aes(fill = Asthma)) +
  labs(y = "Proportion")

Interesting trends I noted:

  • Most obese people said their health was fair or poor
  • Obese people were most likely to have not gone to the dentist in the past year
  • Obese people were most likely to have smoking the household
  • Obese people were mostly likely to have asthma

Machine Learning

After performing exploratory data analysis, I wanted to use machine learning techniques to determine the highest predictors of obesity and create a prediction model. I selected relevent variables that revolved around demographics, access to care, and health outcomes since I believed those would have the most impact on obesity. Then I classified “Non-Obese” as the original Underweight, Normal, and Overweight categories. I used random forest and glm for my models.

library(randomForest)
library(pROC)
library(data.table)

# select relevant data (demographics, access to care, health outcomes)
obesity.data <- data %>%
  select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
  filter(complete.cases(.)) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))

# random forest
obesity.rf <- randomForest(Obesity ~ ., data = obesity.data, importance = TRUE)
importance <- data.frame(obesity.rf$importance)
gini <- importance %>%
  rownames_to_column("variable") %>%
  arrange(desc(MeanDecreaseGini))
head(gini, n = 10)
##      variable   Non.Obese         Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1      Income 0.027296467 -0.0143836078          0.015618309         498.4078
## 2         Age 0.014508513  0.0041712781          0.011608246         419.3644
## 3      Health 0.016359415  0.0355829868          0.021734274         264.4766
## 4   Education 0.013260941 -0.0028532365          0.008745387         194.7179
## 5      County 0.001779015 -0.0001010282          0.001252389         186.8185
## 6     Marital 0.007474865 -0.0045213082          0.004115743         155.8901
## 7        Kids 0.007674388 -0.0049985880          0.004124509         137.7265
## 8      Adults 0.003490343 -0.0016439285          0.002048496         137.6220
## 9  Employment 0.007332163 -0.0041518782          0.004118659         112.4854
## 10       Race 0.008544940 -0.0009852724          0.005880289         108.4254

The top ten variables from random forest were:

  1. Income
  2. Age
  3. Health
  4. Education
  5. County
  6. Marital status
  7. Adults
  8. Kids
  9. Employment
  10. Race
obesity.rf.top <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data = obesity.data, importance = TRUE)

rf.pred <- predict(obesity.rf.top, obesity.data, type="prob")

# glm 
obesity.glm <- glm(Obesity ~ ., data = obesity.data, family = binomial(logit))

coef <- summary(obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 20)
##                                          Variable     Estimate         SE      tval         pval
## 1                                      HealthGood  1.348370701 0.08241119 16.361499 3.601510e-60
## 2                                      HealthFair  1.477199284 0.10563910 13.983453 1.967091e-44
## 3                                 HealthVery Good  0.766727728 0.07836307  9.784299 1.315034e-22
## 4                                      HealthPoor  1.140379388 0.16231813  7.025582 2.131752e-12
## 5                                        AsthmaNo -0.379424988 0.06758017 -5.614443 1.971967e-08
## 6                                     (Intercept) -2.116327055 0.39040345 -5.420872 5.930907e-08
## 7                                             Age  0.008571241 0.00188828  4.539180 5.647344e-06
## 8                                       RaceBlack  0.348432798 0.07712853  4.517561 6.255611e-06
## 9                       DentistMore than one year  0.234693584 0.06585853  3.563602 3.658008e-04
## 10                              EmploymentRetired -0.352015534 0.10529787 -3.343045 8.286437e-04
## 11                                      RaceAsian -1.035357671 0.32503215 -3.185401 1.445537e-03
## 12                                         MedANo  0.353660840 0.13517001  2.616415 8.885846e-03
## 13                   MaritalLiving with a partner  0.324644453 0.13223552  2.455047 1.408661e-02
## 14                          EducationSome college  0.340596777 0.14734178  2.311610 2.079917e-02
## 15                         SexIdentSomething else -0.965556747 0.42137985 -2.291417 2.193933e-02
## 16                                      SexFemale -0.124530056 0.05533626 -2.250424 2.442204e-02
## 17 Income$13,800/$14,000 to under $15,500/$15,700  0.661988788 0.31370074  2.110256 3.483633e-02
## 18                UsualHospital outpatient clinic -0.277785405 0.13304027 -2.087980 3.679966e-02
## 19                               UsualOther place -0.429605057 0.20671431 -2.078255 3.768588e-02
## 20                                      InsuredNo -0.370044068 0.18370493 -2.014339 4.397392e-02

The top ten variables from glm were:

  1. Health status
  2. Asthma
  3. Age
  4. Race
  5. Dentist
  6. Employment
  7. Medicare Part A
  8. Marital status
  9. Education
  10. Sexual Identity
obesity.glm.top <- glm(Obesity ~ Health + Asthma + Age + Race + Dentist + Employment + MedA + Marital + Education + SexIdent, data = obesity.data, family = binomial(logit))
glm.pred <- predict(obesity.glm.top, obesity.data, type="response")

N = nrow(obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
pred.outputs.glm <- vector(mode="numeric", length=N)
pred.outputs.rf <- vector(mode="numeric", length=N)
obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
    train <- filter(obesity.data, s != i)
    test <- filter(obesity.data, s == i)
    obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
    
    #GLM train/test
    glm <- glm(Obesity ~ Health + Asthma + Age + Race + Dentist + Employment + MedA + Marital + Education + SexIdent, data=train, family=binomial(logit))
    glm.pred.curr <- predict(glm, test, type="response")
    pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr

    #RF train/test
    rf <- randomForest(Obesity ~ Age + Income + Health + County + Education + Marital + Kids + Adults + Race + Employment, data=train, ntree=100)
    rf.pred.curr <- predict(rf, newdata=test, type="prob") 
    pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]

    offset <- offset + length(s[s==i])
}

roc(obesity.data$Obesity, glm.pred, ci = TRUE)
## 
## Call:
## roc.default(response = obesity.data$Obesity, predictor = glm.pred,     ci = TRUE)
## 
## Data: glm.pred in 5833 controls (obesity.data$Obesity Non-Obese) < 2266 cases (obesity.data$Obesity Obese).
## Area under the curve: 0.7076
## 95% CI: 0.6954-0.7199 (DeLong)
roc(obs.outputs, pred.outputs.glm, ci = TRUE)
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm,     ci = TRUE)
## 
## Data: pred.outputs.glm in 5833 controls (obs.outputs 1) < 2266 cases (obs.outputs 2).
## Area under the curve: 0.6986
## 95% CI: 0.6863-0.7109 (DeLong)
roc(obesity.data$Obesity, rf.pred[,1], ci = TRUE)
## 
## Call:
## roc.default(response = obesity.data$Obesity, predictor = rf.pred[,     1], ci = TRUE)
## 
## Data: rf.pred[, 1] in 5833 controls (obesity.data$Obesity Non-Obese) > 2266 cases (obesity.data$Obesity Obese).
## Area under the curve: 0.9986
## 95% CI: 0.9979-0.9993 (DeLong)
roc(obs.outputs, pred.outputs.rf, ci = TRUE)
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf,     ci = TRUE)
## 
## Data: pred.outputs.rf in 5833 controls (obs.outputs 1) < 2266 cases (obs.outputs 2).
## Area under the curve: 0.6772
## 95% CI: 0.6648-0.6897 (DeLong)
plot.roc(obesity.data$Obesity, glm.pred, col = "coral")
plot.roc(obs.outputs, pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(obesity.data$Obesity, rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(obs.outputs, pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright", 
       legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
       col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)

The AUC for the models were:

  • GLM Training = 0.7076
  • GLM Cross-Validation = 0.6986
  • RF Training = 0.9986
  • RF Cross-Validation = 0.6772
Children

I also wanted to see if there was a difference between adults and children, so I separted the data by respondant type and performed the same analysis as previously.

child.obesity.data <- data %>%
  filter(FLAGCOM == "Child") %>%
  select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
  filter(complete.cases(.)) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))

# random forest
child.obesity.rf <- randomForest(Obesity ~ ., data = child.obesity.data, importance = TRUE)
importance <- data.frame(child.obesity.rf$importance)
gini <- importance %>%
  rownames_to_column("variable") %>%
  arrange(desc(MeanDecreaseGini))
head(gini, n = 10)
##     variable    Non.Obese         Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1     Income 0.0153539296  0.0142592766         0.0151723661         87.11472
## 2        Age 0.0051132113  0.0217663766         0.0078159236         52.01621
## 3  Education 0.0063196199  0.0095553170         0.0068533396         34.58930
## 4     Health 0.0058235693  0.0205955937         0.0082377298         34.33234
## 5     County 0.0011954064  0.0067714310         0.0021030403         32.08490
## 6       Kids 0.0007308595  0.0018609799         0.0009195634         24.28021
## 7    Marital 0.0042233048 -0.0004837701         0.0034611150         23.18928
## 8       Race 0.0044912730  0.0021706454         0.0040926524         22.04800
## 9     Adults 0.0020377475 -0.0038687373         0.0010744458         19.76558
## 10    Asthma 0.0026255846  0.0070700737         0.0033648849         12.49906

The top ten variables from random forest for children were:

  1. Income
  2. Age
  3. Education
  4. County
  5. Health
  6. Kids
  7. Marital status of parents
  8. Race
  9. Adults
  10. Asthma
child.obesity.rf.top <- randomForest(Obesity ~ Income + Age + Education + County + Health + Kids + Marital + Race + Adults + Asthma, data = child.obesity.data, importance = TRUE)

child.rf.pred <- predict(child.obesity.rf.top, child.obesity.data, type="prob")

# glm 
child.obesity.glm <- glm(Obesity ~ ., data = child.obesity.data, family = binomial(logit))

coef <- summary(child.obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 25)
##                                          Variable   Estimate         SE      tval         pval
## 1                                             Age -0.1605368 0.02260187 -7.102811 1.222450e-12
## 2                                      HealthGood  1.0320349 0.20800900  4.961492 6.995386e-07
## 3                                      HealthFair  1.7176299 0.39584298  4.339170 1.430220e-05
## 4                                        AsthmaNo -0.5403560 0.16907222 -3.196007 1.393438e-03
## 5                                       SexFemale -0.4658469 0.14852340 -3.136522 1.709646e-03
## 6                                 HealthVery Good  0.5122784 0.17298608  2.961385 3.062585e-03
## 7                    MaritalLiving with a partner  0.8144577 0.33829025  2.407571 1.605905e-02
## 8                 Income$75,000 to under $100,000 -1.6341018 0.69118576 -2.364201 1.806902e-02
## 9                          Income$250,000 or more -1.6450597 0.73899854 -2.226066 2.600974e-02
## 10               Income$150,000 to under $250,000 -1.4935607 0.69239247 -2.157101 3.099777e-02
## 11                                  CountyChester -0.5632715 0.28275805 -1.992062 4.636430e-02
## 12                                         SelfNo  0.2922436 0.15077956  1.938217 5.259669e-02
## 13 Income$29,300/$29,750 to under $31,200/$31,900 -1.6547405 0.86981284 -1.902410 5.711759e-02
## 14                              EmploymentRetired -3.1024600 1.64424819 -1.886856 5.917966e-02
## 15 Income$27,600/$28,000 to under $29,300/$29,750 -1.7516617 0.97228479 -1.801593 7.160943e-02
## 16 Income$55,000/$55,400 to under $60,000/$60,000 -1.5476525 0.87625497 -1.766213 7.736020e-02
## 17               Income$100,000 to under $150,000 -1.1224344 0.67372750 -1.666007 9.571211e-02
## 18                         SexIdentSomething else  3.1873632 1.91759685  1.662165 9.647960e-02
## 19                                      InsuredNo -1.0207388 0.64929723 -1.572067 1.159350e-01
## 20     Income$5,750/$5,850 to under $7,750/$7,850 -1.9483382 1.34704730 -1.446377 1.480715e-01
## 21 Income$39,100/$40,000 to under $41,400/$41,900 -1.4361522 1.02721566 -1.398102 1.620825e-01
## 22                               MaritalSeparated -0.7559577 0.54646623 -1.383357 1.665555e-01
## 23                                      RaceBlack  0.3134588 0.22681875  1.381979 1.669781e-01
## 24 Income$13,800/$14,000 to under $15,500/$15,700 -1.6891705 1.29546672 -1.303909 1.922647e-01
## 25       EmploymentFull-time student/Job training -1.7556245 1.36297975 -1.288078 1.977188e-01

The top ten variables from glm for children were:

  1. Age
  2. Health
  3. Asthma
  4. Sex
  5. Marital status of parents
  6. Income
  7. Self insurance
  8. Employment
  9. Sexual Identity
  10. Insured
child.obesity.glm.top <- glm(Obesity ~ Age + Health + Asthma + Sex + Marital + Income + Self + Employment + SexIdent + Insured, data = child.obesity.data, family = binomial(logit))
child.glm.pred <- predict(child.obesity.glm.top, child.obesity.data, type="response")

N = nrow(child.obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
child.pred.outputs.glm <- vector(mode="numeric", length=N)
child.pred.outputs.rf <- vector(mode="numeric", length=N)
child.obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
    train <- filter(child.obesity.data, s != i)
    test <- filter(child.obesity.data, s == i)
    child.obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
    
    #GLM train/test
    glm <- glm(Obesity ~ Age + Health + Asthma + Sex + Marital + Income + Self + Employment + SexIdent + Insured, data=train, family=binomial(logit))
    glm.pred.curr <- predict(glm, test, type="response")
    child.pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr

    #RF train/test
    rf <- randomForest(Obesity ~ Income + Age + Education + County + Health + Kids + Marital + Race + Adults + Asthma, data=train, ntree=100)
    rf.pred.curr <- predict(rf, newdata=test, type="prob") 
    child.pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]

    offset <- offset + length(s[s==i])
}

roc(child.obesity.data$Obesity, child.glm.pred, ci = TRUE)
## 
## Call:
## roc.default(response = child.obesity.data$Obesity, predictor = child.glm.pred,     ci = TRUE)
## 
## Data: child.glm.pred in 1529 controls (child.obesity.data$Obesity Non-Obese) < 301 cases (child.obesity.data$Obesity Obese).
## Area under the curve: 0.7894
## 95% CI: 0.7619-0.817 (DeLong)
roc(child.obs.outputs, child.pred.outputs.glm, ci = TRUE)
## 
## Call:
## roc.default(response = child.obs.outputs, predictor = child.pred.outputs.glm,     ci = TRUE)
## 
## Data: child.pred.outputs.glm in 1529 controls (child.obs.outputs 1) < 301 cases (child.obs.outputs 2).
## Area under the curve: 0.7244
## 95% CI: 0.6921-0.7567 (DeLong)
roc(child.obesity.data$Obesity, child.rf.pred[,1], ci = TRUE)
## 
## Call:
## roc.default(response = child.obesity.data$Obesity, predictor = child.rf.pred[,     1], ci = TRUE)
## 
## Data: child.rf.pred[, 1] in 1529 controls (child.obesity.data$Obesity Non-Obese) > 301 cases (child.obesity.data$Obesity Obese).
## Area under the curve: 0.9986
## 95% CI: 0.9972-1 (DeLong)
roc(child.obs.outputs, child.pred.outputs.rf, ci = TRUE)
## 
## Call:
## roc.default(response = child.obs.outputs, predictor = child.pred.outputs.rf,     ci = TRUE)
## 
## Data: child.pred.outputs.rf in 1529 controls (child.obs.outputs 1) < 301 cases (child.obs.outputs 2).
## Area under the curve: 0.7066
## 95% CI: 0.6744-0.7387 (DeLong)
plot.roc(child.obesity.data$Obesity, child.glm.pred, col = "coral")
plot.roc(child.obs.outputs, child.pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(child.obesity.data$Obesity, child.rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(child.obs.outputs, child.pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright", 
       legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
       col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)

The AUC for the models were:

  • GLM Training = 0.7894
  • GLM Cross-Validation = 0.7244
  • RF Training = 0.9986
  • RF Cross-Validation = 0.7066
Adults
adult.obesity.data <- data %>%
  filter(FLAGCOM == "Adult") %>%
  select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
  filter(complete.cases(.)) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
  mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))

# random forest
adult.obesity.rf <- randomForest(Obesity ~ ., data = adult.obesity.data, importance = TRUE)
importance <- data.frame(adult.obesity.rf$importance)
gini <- importance %>%
  rownames_to_column("variable") %>%
  arrange(desc(MeanDecreaseGini))
head(gini, n = 10)
##      variable     Non.Obese         Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1      Income  2.464178e-02 -1.644699e-02         0.0117284846        425.76153
## 2         Age  1.669662e-02 -5.324369e-03         0.0097822595        319.97189
## 3      Health  2.200059e-02  2.733470e-02         0.0236675918        216.18614
## 4   Education  1.301671e-02 -4.772839e-03         0.0074197266        162.07644
## 5      County -9.847389e-05 -1.491345e-03        -0.0005410109        155.51560
## 6     Marital  6.987936e-03 -2.049963e-03         0.0041536409        136.67958
## 7      Adults  3.374632e-03 -1.860507e-03         0.0017297277        118.63535
## 8        Kids  4.937940e-03 -3.111993e-03         0.0024077995        111.00460
## 9  Employment  1.037812e-02 -6.915022e-03         0.0049545151        101.45155
## 10       Race  8.885691e-03  3.781615e-05         0.0061056536         86.94754

The top ten variables from random forest for adults were:

  1. Income
  2. Age
  3. Health
  4. Education
  5. County
  6. Marital
  7. Adults
  8. Kids
  9. Employment
  10. Race
adult.obesity.rf.top <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data = adult.obesity.data, importance = TRUE)

adult.rf.pred <- predict(adult.obesity.rf.top, adult.obesity.data, type="prob")

# glm 
adult.obesity.glm <- glm(Obesity ~ ., data = adult.obesity.data, family = binomial(logit))

coef <- summary(adult.obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 20)
##                                          Variable   Estimate         SE      tval         pval
## 1                                      HealthGood  1.4461474 0.09636849 15.006434 6.663477e-51
## 2                                      HealthFair  1.5559778 0.11775071 13.214170 7.268390e-40
## 3                                 HealthVery Good  0.8451756 0.09358468  9.031133 1.699037e-19
## 4                                      HealthPoor  1.3095187 0.17318565  7.561358 3.988824e-14
## 5                                        AsthmaNo -0.3740692 0.07605725 -4.918258 8.731766e-07
## 6                                       RaceBlack  0.3692622 0.08338249  4.428534 9.487583e-06
## 7                                     (Intercept) -1.9024318 0.44751104 -4.251139 2.126857e-05
## 8                       DentistMore than one year  0.2438354 0.06855360  3.556857 3.753178e-04
## 9                                       RaceAsian -1.5160284 0.43284358 -3.502485 4.609389e-04
## 10                UsualHospital outpatient clinic -0.4058594 0.14570116 -2.785560 5.343529e-03
## 11 Income$13,800/$14,000 to under $15,500/$15,700  0.8820372 0.33223519  2.654858 7.934180e-03
## 12                Income$75,000 to under $100,000  0.6099591 0.25744866  2.369246 1.782441e-02
## 13                         SexIdentSomething else -0.9782676 0.44670217 -2.189977 2.852589e-02
## 14     Income$7,750/$7,850 to under $9,750/$9,900  0.5998652 0.28579896  2.098906 3.582518e-02
## 15                                         MedANo  0.2936879 0.14184805  2.070440 3.841113e-02
## 16                              EmploymentRetired -0.2329424 0.11289807 -2.063299 3.908424e-02
## 17                               UsualOther place -0.4215336 0.21192100 -1.989107 4.668936e-02
## 18 Income$31,200/$31,900 to under $35,400/$36,000  0.5270726 0.27853937  1.892273 5.845460e-02
## 19 Income$19,600/$20,000 to under $23,500/$23,750  0.4876175 0.27601337  1.766644 7.728779e-02
## 20                          EducationSome college  0.2679400 0.15418461  1.737787 8.224836e-02

The top ten variables from glm for adults were:

  1. Health
  2. Asthma
  3. Race
  4. Dentist
  5. Usual
  6. Marital
  7. Income
  8. Sexual identity
  9. Medicare Part A
  10. Employment
adult.obesity.glm.top <- glm(Obesity ~ Health + Asthma + Race + Dentist + Usual + Marital + Income + SexIdent + MedA + Employment, data = adult.obesity.data, family = binomial(logit))
adult.glm.pred <- predict(adult.obesity.glm.top, adult.obesity.data, type="response")

N = nrow(adult.obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
adult.pred.outputs.glm <- vector(mode="numeric", length=N)
adult.pred.outputs.rf <- vector(mode="numeric", length=N)
adult.obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
    train <- filter(adult.obesity.data, s != i)
    test <- filter(adult.obesity.data, s == i)
    adult.obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
    
    #GLM train/test
    glm <- glm(Obesity ~ Health + Asthma + Race + Dentist + Usual + Marital + Income + SexIdent + MedA + Employment, data=train, family=binomial(logit))
    glm.pred.curr <- predict(glm, test, type="response")
    adult.pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr

    #RF train/test
    rf <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data=train, ntree=100)
    rf.pred.curr <- predict(rf, newdata=test, type="prob") 
    adult.pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]

    offset <- offset + length(s[s==i])
}

roc(adult.obesity.data$Obesity, adult.glm.pred, ci = TRUE)
## 
## Call:
## roc.default(response = adult.obesity.data$Obesity, predictor = adult.glm.pred,     ci = TRUE)
## 
## Data: adult.glm.pred in 4304 controls (adult.obesity.data$Obesity Non-Obese) < 1965 cases (adult.obesity.data$Obesity Obese).
## Area under the curve: 0.6882
## 95% CI: 0.6744-0.7019 (DeLong)
roc(adult.obs.outputs, adult.pred.outputs.glm, ci = TRUE)
## 
## Call:
## roc.default(response = adult.obs.outputs, predictor = adult.pred.outputs.glm,     ci = TRUE)
## 
## Data: adult.pred.outputs.glm in 4304 controls (adult.obs.outputs 1) < 1965 cases (adult.obs.outputs 2).
## Area under the curve: 0.669
## 95% CI: 0.655-0.683 (DeLong)
roc(adult.obesity.data$Obesity, adult.rf.pred[,1], ci = TRUE)
## 
## Call:
## roc.default(response = adult.obesity.data$Obesity, predictor = adult.rf.pred[,     1], ci = TRUE)
## 
## Data: adult.rf.pred[, 1] in 4304 controls (adult.obesity.data$Obesity Non-Obese) > 1965 cases (adult.obesity.data$Obesity Obese).
## Area under the curve: 0.9992
## 95% CI: 0.9987-0.9997 (DeLong)
roc(adult.obs.outputs, adult.pred.outputs.rf, ci = TRUE)
## 
## Call:
## roc.default(response = adult.obs.outputs, predictor = adult.pred.outputs.rf,     ci = TRUE)
## 
## Data: adult.pred.outputs.rf in 4304 controls (adult.obs.outputs 1) < 1965 cases (adult.obs.outputs 2).
## Area under the curve: 0.6477
## 95% CI: 0.6335-0.6619 (DeLong)
plot.roc(adult.obesity.data$Obesity, adult.glm.pred, col = "coral")
plot.roc(adult.obs.outputs, adult.pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(adult.obesity.data$Obesity, adult.rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(adult.obs.outputs, adult.pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright", 
       legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
       col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)

The AUC for the models were:

  • GLM Training = 0.6882
  • GLM Cross-Validation = 0.669
  • RF Training = 0.9992
  • RF Cross-Validation = 0.6477

Conclusion

In conclusion, the prediction models created were not as accurate as I would have liked, but I did uncover some interesting predictors for obesity.

To recap the top predictors for children and adults:

Children

Adults

Socioeconomic status and race has been shown to affect obesity, so I wasn’t surprised to see Income, Education, Employment, self insured (Self), County, and Race. Access to care is also important in influencing obesity since those who have access to healthcare professionals and abilty to pay for services would more likely lead a healthier life - this includes whether the respondant is insured (Insured) and where they usually go to for care (Usual). Other health outcomes may be confounding factors - health status (Health), Asthma, and whether the respondant has been to the dentist in the past year (Dentist). Also, since only those who are 65 or older or disabled are eligible for Medicare Part A, MedA is also likely a confounding health outcome factor.

I did some additional research to try to explain to remaining variables. Age is ranked highly as a predictor of obesity and the current throught is that once someone becomes obese it’s very difficult to return to a normal weight category and will likely remain obese for the rest of their life. Sex is ranked highly as a predictor in children and females are more likely to not be obese and studies have shown that it may be due to societal standards of being skinny equating to beauty. Women and especially impressionable young girls are likely to undergo extreme diets and may suffer from eating disorders to maintain this “ideal” weight. Household size (Kids and Adults) may influence obesity since those growing up with more siblings and adult figures in their life will tend towards playing with their family members instead of being glued to electonics like so many people are today. Lastly marital status (Marital) and sexual identity (SexIdent) may be a source of stress and anxiety and lead to obesity.

Identifying these predictors is only half the battle. Obesity rates will continue to rise unless healthcare policymakers take action and constantly reevaluate their initiatives for effectiveness. Access to care continues to be an important factor and it’s crucial that healthcare is available to everyone regardless of income or employment status. In addition, I think it’s important that individuals be treated as individuals that come with their own health history and emotional baggage and unique health programs are created to suit one’s needs.